Say-as classification for alphabetic words in Japanese texts
نویسندگان
چکیده
Modern Japanese texts often include Western sourced words written in Roman alphabet. For example, a shopping directory in a web portal, which lists more than 8,000 shops, includes a total of 6,400 alphabetic words. As most of them are very new and idiosyncratic proper nouns, it is impractical to assume all those alphabetic words can be registered in the word dictionary of a text-to-speech synthesis system; their pronunciations must be derived automatically. Our solution consists of two steps. Step 1 classifies each unknown alphabetic word into a say-as class (English, Japanese, French, Italian or English spell-out), which indicates how it is to be read, and Step 2 derives the pronunciation using the grapheme-to-phoneme conversion rules for the classified sayas class. This paper proposes a method of say-as classification (i.e. Step 1) that uses the Support Vector Machine. After some trial and error, we achieved 89.2% accuracy for web shop data, which we think sufficient for practical use.
منابع مشابه
Long vowel detection for letter-to-sound conversion for Japanese sourced words transliterated into the alphabet
Modern Japanese texts often include Western sourced words written in the Roman alphabet. Even Japanese sourced words are sometimes transliterated into the Roman alphabet. As most of them are very new and idiosyncratic proper nouns, it is impractical to assume all those alphabetic words can be registered in the word dictionary of a text-to-speech system; their pronunciation must be derived autom...
متن کاملAn Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کاملOrthographic Reading Deficits in Dyslexic Japanese Children: Examining the Transposed-Letter Effect in the Color-Word Stroop Paradigm
In orthographic reading, the transposed-letter effect (TLE) is the perception of a transposed-letter position word such as "cholocate" as the correct word "chocolate." Although previous studies on dyslexic children using alphabetic languages have reported such orthographic reading deficits, the extent of orthographic reading impairment in dyslexic Japanese children has remained unknown. This st...
متن کاملHigh-Performance Bilingual Text Alignment Using Statistical and Dictionary Information
This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of...
متن کاملAcquired Dyslexia in Three Writing Systems: Study of a Portuguese-Japanese Bilingual Aphasic Patient
The Japanese language is represented by two different codes: syllabic and logographic while Portuguese employs an alphabetic writing system. Studies on bilingual Portuguese-Japanese individuals with acquired dyslexia therefore allow an investigation of the interaction between reading strategies and characteristics of three different writing codes. The aim of this study was to examine the differ...
متن کامل